Genome-resolved metagenomics on anaerobic enrichments from deep subsurface groundwaters
Author
george.westmeijer@umu.se
Published
November 29, 2024
Read data and assign colors
The 16S ribosomal RNA data is from several sequencing projects. In chronological order, P15009 contains the groundwater samples from t0 (sampled Oct 2019), while P21015 contains the samples from t24 (sampled Feb 2020) from identical groundwaters. P21015 contains the sequencing libraries from the cultures as part of phase a (one meteoric groundwater), while P26201 contains the libraries from the cultures as part of phase b (three groundwaters as inoculum).
All 16S libraries were processed with the bioinformatic pipeline nf-core/ampliseq (v2.12.0), using the SBDI-GTDB (R09-RS220-1) as a reference.
rare %>%inner_join(smd, by =c("site"="sample")) %>%# One sample has ~ 1 million reads filter(sample <150000) %>%ggplot(aes(sample, species, group = site, colour = origin)) +geom_line(linewidth =0.75) +scale_color_manual(values = cols.origin, name ="Groundwater", labels =c("Meteoric", "Marine", "Saline")) +labs(x ="No. of sequenced reads", y ="Rarefied No. of ASVs") +scale_x_continuous(labels =c("0", "50.000", "100.000", "150.000")) +theme_tidy()
Figure 2: Rarefaction curves for the groundwater samples (t0) for each groundwater type (n = 6).
Overview community structure t0
To visualize the community composition, the most abundant phyla are identified while grouping less abundant phyla as “Other”. ASVs that are not identified on phylum level are included as “Unidentified”.
Expand code
t <- seqtab %>%filter(sample %in% t0) %>%inner_join(tax, by ="seqid") %>%group_by(phylum, sample) %>%# Sum the abundance of each phylum within a samplesummarise(relab =sum(relab), .groups ='drop_last') %>%# Calculate the mean abundance of each phylum over the categoriessummarise(mean_relab =sum(relab), .groups ='drop') %>%filter(!is.na(phylum)) %>%top_n(8, mean_relab)t %>%arrange(desc(mean_relab)) %>% knitr::kable()
As the hydrochemistry data is from a monitoring program, the sampling data closest to groundwater sampling was selected. All units are mg l-1, except for EC (mS m-1), pH, T (degrees C), and O18 (ppt SMOW)
The number of ASVs per sample are estimated with the specnumber() function from the Vegan library.
Expand code
adiv <- seqtab %>%select(-relab) %>%pivot_wider(names_from ="seqid", values_from ="count", values_fill =0) %>%column_to_rownames('sample') %>% vegan::specnumber() %>%as.data.frame() %>%rownames_to_column('sample') %>%filter(sample %in% t0) %>%rename(specnumber =2)chem <- adiv %>%# Add the species richness for each sampleinner_join(chem, by ="sample")
In order to run the RDA, the absolute counts are standardized using a Hellinger transformation (square root of the relative abundance).
Expand code
hellinger <- seqtab %>%filter(sample %in% t0) %>%# Standard Vegan transformation: Turn table with with samples as *rows*select(sample, seqid, count) %>%pivot_wider(names_from ="seqid", values_from ="count", values_fill =0) %>%# Turn into a numeric matrixcolumn_to_rownames('sample') %>% vegan::decostand(method ="hellinger") %>%rownames_to_column("sample") %>%pivot_longer(-1, names_to ="seqid", values_to ="hellinger")m <- hellinger %>%# Create a matrixspread(seqid, hellinger) %>%column_to_rownames("sample")
The RDA is called while providing a large number of constraining factors, as the RDA function will by default not include redundant variables (Pearson r > 0.9).
Some constraints or conditions were aliased because they were redundant. This
can happen if terms are linearly dependent (collinear): 'ph_lab', 'ec_lab',
'temp_w', 'nh4_n', 'feii'
Expand code
# This one will contain the proportions explained which we get by dividing the# eigenvalue of each component with the sum of eigenvalues for all components.p <-c(rda$CCA$eig, rda$CA$eig)/sum(c(rda$CCA$eig, rda$CA$eig))# This one is collecting the coordinates for arrows that will depict how the # factors in our model point in the coordinatea <- rda$CCA$biplot %>%data.frame() %>% tibble::rownames_to_column('factor')lab <-tibble(factor =c("o18", "doc"),label =c("delta^18*O", "DOC")) %>%inner_join(a, by ="factor")summary(rda)
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
ggsave("figures/fig-1.pdf", width =18, height =10, units ="cm")
Network analysis: phylum
Expand code
t <-c("Spirochaetota", "Acidobacteriota", "Atribacterota", "Chloroflexota","Omnitrophota", "Nanoarchaeota", "Campylobacterota", "Pseudomonadota")network.survey <-read_tsv("data/network-survey.tsv", show_col_types = F)# Correlation between the abundance of the phyla using the culturesfit <-tibble("phylum"= t, "r.squared"=NA)for (i in t) { model <-lm(relab ~ cpr, data = network.survey %>%filter(phylum == i)) %>%summary() fit[fit$phylum == i, "r.squared"] <- model$adj.r.squared %>%round(digits =2)}fit
set.seed(999)nmds <- seqtab %>%# Full size fractioninner_join(smd, by ="sample") %>%filter(!sample %in%paste("P21015_10", seq(42,81), sep ="")) %>%# Data from bachelor projectfilter(fraction =="non-fractionated"| fraction =="environmental") %>%group_by(seqid) %>%filter(count >2) %>%ungroup() %>%select(seqid, sample, count) %>%spread(seqid, count, fill =0) %>%column_to_rownames("sample") %>% vegan::metaMDS()
Square root transformation
Wisconsin double standardization
Run 0 stress 0.211602
Run 1 stress 0.2345592
Run 2 stress 0.2112576
... New best solution
... Procrustes: rmse 0.008926961 max resid 0.03646891
Run 3 stress 0.212245
Run 4 stress 0.2286573
Run 5 stress 0.211117
... New best solution
... Procrustes: rmse 0.00353371 max resid 0.02077429
Run 6 stress 0.2128131
Run 7 stress 0.2291371
Run 8 stress 0.211117
... New best solution
... Procrustes: rmse 1.543506e-05 max resid 8.833604e-05
... Similar to previous best
Run 9 stress 0.4051623
Run 10 stress 0.2423527
Run 11 stress 0.2308933
Run 12 stress 0.212245
Run 13 stress 0.2461051
Run 14 stress 0.2462855
Run 15 stress 0.2210886
Run 16 stress 0.2335761
Run 17 stress 0.2339457
Run 18 stress 0.211117
... Procrustes: rmse 1.554376e-05 max resid 6.999867e-05
... Similar to previous best
Run 19 stress 0.2291882
Run 20 stress 0.2318306
*** Best solution repeated 2 times
Expand code
vegan::scores(nmds)$sites %>%as.data.frame() %>%rownames_to_column("sample") %>%inner_join(smd, by ="sample") -> nmds.scores
nmds <- seqtab %>%# Small size fractioninner_join(smd, by ="sample") %>%filter(!sample %in%paste("P21015_10", seq(42,81), sep ="")) %>%# Data from bachelor projectfilter(fraction =="0.1 - 0.45 µm"| fraction =="Environmental") %>%group_by(seqid) %>%filter(count >2) %>%ungroup() %>%select(seqid, sample, count) %>%spread(seqid, count, fill =0) %>%column_to_rownames("sample") %>% vegan::metaMDS()
Square root transformation
Wisconsin double standardization
Run 0 stress 0.2081801
Run 1 stress 0.1495638
... New best solution
... Procrustes: rmse 0.137786 max resid 0.4403868
Run 2 stress 0.1495638
... New best solution
... Procrustes: rmse 1.671599e-05 max resid 4.648518e-05
... Similar to previous best
Run 3 stress 0.1495638
... Procrustes: rmse 7.217654e-05 max resid 0.000230497
... Similar to previous best
Run 4 stress 0.197663
Run 5 stress 0.1858236
Run 6 stress 0.1495638
... Procrustes: rmse 5.761408e-05 max resid 0.0001885813
... Similar to previous best
Run 7 stress 0.1696327
Run 8 stress 0.1933782
Run 9 stress 0.1696589
Run 10 stress 0.1495638
... New best solution
... Procrustes: rmse 1.230415e-05 max resid 3.972137e-05
... Similar to previous best
Run 11 stress 0.1495638
... Procrustes: rmse 1.771555e-05 max resid 6.448628e-05
... Similar to previous best
Run 12 stress 0.1496035
... Procrustes: rmse 0.002179419 max resid 0.009601999
... Similar to previous best
Run 13 stress 0.1496035
... Procrustes: rmse 0.002179319 max resid 0.009588255
... Similar to previous best
Run 14 stress 0.1858235
Run 15 stress 0.1696587
Run 16 stress 0.1496035
... Procrustes: rmse 0.002166634 max resid 0.009522663
... Similar to previous best
Run 17 stress 0.1495638
... Procrustes: rmse 2.06804e-05 max resid 7.56214e-05
... Similar to previous best
Run 18 stress 0.1864292
Run 19 stress 0.1495638
... Procrustes: rmse 1.367408e-05 max resid 4.399976e-05
... Similar to previous best
Run 20 stress 0.1495638
... Procrustes: rmse 5.688851e-05 max resid 0.0001883812
... Similar to previous best
*** Best solution repeated 8 times
Expand code
vegan::scores(nmds)$sites %>%as.data.frame() %>%rownames_to_column("sample") %>%inner_join(smd, by ="sample") -> nmds.scores
Square root transformation
Wisconsin double standardization
Run 0 stress 0.1203684
Run 1 stress 0.1448066
Run 2 stress 0.1374292
Run 3 stress 0.1212802
Run 4 stress 0.1331208
Run 5 stress 0.1230642
Run 6 stress 0.13485
Run 7 stress 0.14842
Run 8 stress 0.133246
Run 9 stress 0.1408211
Run 10 stress 0.1212186
Run 11 stress 0.1440819
Run 12 stress 0.1336567
Run 13 stress 0.1400868
Run 14 stress 0.1447862
Run 15 stress 0.1504146
Run 16 stress 0.1212186
Run 17 stress 0.1342165
Run 18 stress 0.1305863
Run 19 stress 0.133246
Run 20 stress 0.1476626
*** Best solution was not repeated -- monoMDS stopping criteria:
14: stress ratio > sratmax
6: scale factor of the gradient < sfgrmin
Expand code
nmds.scores <- vegan::scores(nmds)$sites %>%as.data.frame() %>%rownames_to_column("sample") %>%inner_join(smd, by ="sample")
ggsave("figures/fig-3.pdf", width =14, height =9, units ="cm")
Microscopy
Expand code
growth <-read_tsv("data/micro_cultures.tsv", col_types =cols(.default =col_character(), count =col_integer(),date =col_date(format ="%d-%m-%Y")) ) %>%mutate(week =case_when( date =="2021-10-15"~1, date =="2021-11-11"~5, date =="2021-11-25"~7, date =="2021-12-09"~9, date =="2021-12-30"~12, date =="2022-01-18"~15, date =="2022-02-03"~17 )) %>%filter(!sample %in%c("P26201_1051","P26201_1024")) %>%mutate(label =case_when( groundwater =="KR0015"~"Meteoric", groundwater =="SA1420"~"Marine", groundwater =="SA2600"~"Saline" ))counts <-tibble(label =c("Meteoric","Meteoric","Marine","Marine","Saline","Saline"),n =c(680220, 890910, 59220, 73910, 14100, 23000)) %>%group_by(label) %>%summarise(mean =mean(n), sd =sd(n))
Expand code
# Microscopy counts on KR0015 groundwater in duplicates, obtaining 680,220 and 890,910 cells ml-1.# Microscopy counts on SA1420 groundwater in duplicates, obtaining 59220 and 73910 cells ml-1.# For SA2600, this number was 14,100 and 23,000 cells ml-1growth %>%ggplot(aes(x = week, y = count, colour = fraction)) +geom_point() +geom_line(aes(linetype = medium)) +facet_wrap(~ label, ncol =3) +geom_hline(data = counts, aes(yintercept = mean)) +scale_y_log10(limits =c(1e4,1e8),breaks =c(1e4,1e5,1e6,1e7),labels =c(expression("10"^"4"),expression("10"^"5"),expression("10"^"6"),expression("10"^"7")) ) +scale_x_continuous(breaks =c(1,3,5,7,9,11,13,15,17)) +scale_colour_manual(values =c("unfiltered"="#1B5331", "filtered"="#A2A475"),labels =c("0.1 - 0.45 µm","All cells")) +scale_linetype(name ="Medium:", labels =c("Acetate", "Cell lysate")) +labs(x ="", y =expression("Cell number (mL"^"-1"*")"), colour ="Fraction:") +theme_tidy(legend ="bottom") +theme(legend.margin =margin(t =-10),plot.margin =margin(),panel.border =element_blank(),axis.line =element_line(linewidth =0.5, colour ="black") )
Figure 3: Cell counts of enrichment cultures according to fluorescence microscopy. Each line represents the cell density of a single culture and the black horizontal line denotes the mean cell count of the groundwater at t0 (n = 2).
Expand code
ggsave(filename ="figures/fig-s2.pdf", width =18, height =9, units ="cm")
ggsave(filename ="figures/fig-2.pdf", width =12, height =8, units ="cm")
Areaplot KR0015
Expand code
t <- seqtab %>%filter(sample %in%paste("P21015", seq(1042,1081), sep ="_") ) %>%inner_join(tax, by ="seqid") %>%group_by(phylum, sample) %>%# Sum the abundance of each phylum within a samplesummarise(relab =sum(relab), .groups ='drop_last') %>%# Calculate the mean abundance of each phylum over the categoriessummarise(mean_relab =sum(relab), .groups ='drop') %>%filter(!is.na(phylum)) %>%top_n(9, mean_relab) %>%filter(phylum !="Bacteroidota")# The most abundant phyla, averaged over all culturest %>%arrange(desc(mean_relab))
seqtab %>%filter(sample %in%paste("P21015", seq(1042,1081), sep ="_") ) %>%inner_join(taxref, by ="seqid") %>%# Summarize in order to have the sum for each category and topphylumgroup_by(sample, topphylum) %>%summarise(relab =sum(relab), .groups ='drop') %>%inner_join(smd, by ="sample") %>%mutate(age =as.numeric(.$age)) %>%mutate(fraction =case_when( fraction =="non-fractionated"& medium =="ace"~"Acetate, all cells", fraction =="non-fractionated"& medium =="lys"~"Lysate, all cells", fraction =="fractionated"& medium =="ace"~"Acetate, < 0.45 µm fraction", fraction =="fractionated"& medium =="lys"~"Lysate, < 0.45 µm fraction" )) %>%mutate(topphylum =factor(topphylum, levels =rev(names(cols.phylum)))) %>%# Call the plotggplot(aes( age, relab *100, fill = topphylum)) +geom_col() +scale_x_continuous(n.breaks =10) +facet_wrap(~ fraction, nrow =1) +labs(x ='Week', y ='Relative abundance (%)', fill ='') +scale_fill_manual(values = cols.phylum) +guides(fill =guide_legend(reverse =TRUE, ncol =3)) +theme_tidy(ratio =1.2, legend ="bottom") +theme(panel.border =element_rect(fill ="transparent", linewidth =0.5),strip.background =element_blank(), legend.background =element_blank(),legend.key.height =unit(4, "mm"),legend.key.spacing.y =unit(0, "mm"),plot.margin =margin(),axis.ticks.length =unit(0.5, "mm"),legend.box.margin =margin(t =-5) )
Expand code
ggsave("figures/fig-s3.png", height =9, width =18, units ="cm")
Genome-resolved metagenomics
Basic stats on the MAGs
Expand code
t <-c("Bacillota","Desulfobacterota","Pseudomonadota","Patescibacteria","Nanoarchaeota","Other")# Prepare data for ploti <- quality %>%inner_join(bintax, by ="genome") %>%mutate(phylum =ifelse(phylum %in% t, phylum, "Other")) %>%mutate(phylum =factor(phylum, levels = t)) %>%select(genome, phylum, completeness, gc_content, genome_size, coding_density) %>%distinct() %>%mutate(genome.size = genome_size /1e6)
Expand code
lm(gc_content ~ genome_size, data = quality) %>%summary()
Call:
lm(formula = gc_content ~ genome_size, data = quality)
Residuals:
Min 1Q Median 3Q Max
-0.15655 -0.06232 -0.02604 0.07979 0.13851
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.058e-01 3.670e-02 8.332 1.26e-09 ***
genome_size 5.080e-08 1.029e-08 4.935 2.23e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09515 on 33 degrees of freedom
Multiple R-squared: 0.4246, Adjusted R-squared: 0.4072
F-statistic: 24.36 on 1 and 33 DF, p-value: 2.233e-05
Expand code
p1 <- i %>%ggplot(aes( genome.size, gc_content, color = phylum,fill = phylum )) +geom_smooth(method ="lm", formula = y ~ x, colour ="black", fill ="#d9d9d9", se =TRUE, linewidth =0.4) +geom_point(size =2.5, shape =21, stroke =0.5) +labs(x ="Estimated genome size (Mbp)", y ="Estimated GC content (%)", color ="") +scale_colour_manual(values = cols.mag) +scale_fill_manual(values =alpha(cols.mag, alpha =0.5), guide ="none") +theme_tidy() +scale_x_continuous(breaks =c(1,2,3,4,5,6)) +annotate("text", x =Inf, y =-Inf, label ="paste(R ^ 2, \" = 0.41\")", size =7/.pt, vjust =-0.5, hjust =1, parse =TRUE) +theme(axis.line =element_line(),panel.border =element_blank() )
Expand code
p2 <- i %>%ggplot(aes( phylum, completeness, group = phylum, fill = phylum, color = phylum )) +geom_boxplot(linewidth =0.5, outlier.colour ="white") +geom_jitter(width =0.2, size =2, shape =21, stroke =0.5) +labs(x ="", y ="Estimated completeness (%)", fill ="") +scale_fill_manual(values =alpha(cols.mag, alpha =0.5), guide ="none") +scale_color_manual(values = cols.mag, guide ="none") +theme_tidy(legend ="bottom") +theme(axis.line =element_line(),panel.border =element_blank(),axis.text.x =element_blank(),axis.ticks.x =element_blank(),axis.title.y =element_text(margin =margin()),plot.margin =margin() )
Expand code
lm(coding_density ~ genome_size, data = quality) %>%summary()
Call:
lm(formula = coding_density ~ genome_size, data = quality)
Residuals:
Min 1Q Median 3Q Max
-0.05556 -0.01303 -0.00046 0.01275 0.03787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.172e-01 8.112e-03 113.064 < 2e-16 ***
genome_size -6.991e-09 2.275e-09 -3.072 0.00424 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02103 on 33 degrees of freedom
Multiple R-squared: 0.2224, Adjusted R-squared: 0.1989
F-statistic: 9.439 on 1 and 33 DF, p-value: 0.004236
Expand code
p3 <- i %>%ggplot(aes( genome.size, coding_density, color = phylum,fill = phylum )) +geom_smooth(method ="lm", formula = y ~ x, colour ="black", fill ="#d9d9d9", se =TRUE, linewidth =0.4) +geom_point(size =2.5, shape =21, stroke =0.5) +labs(x ="Estimated genome size (Mbp)", y ="Estimated coding density", color ="") +scale_colour_manual(values = cols.mag, guide ="none") +scale_fill_manual(values =alpha(cols.mag, alpha =0.5), guide ="none") +theme_tidy() +scale_x_continuous(breaks =c(1,2,3,4,5,6)) +scale_y_continuous(n.breaks =7) +annotate("text", x =Inf, y =Inf, label ="paste(R ^ 2, \" = 0.20\")", size =7/.pt, vjust =1, hjust =1, parse =TRUE) +theme(axis.line =element_line(),panel.border =element_blank() )
Expand code
nmds <- coverm %>%# Sum the abundance of both coverage per metagenomemutate(abundance =1) %>%inner_join(emapper, by ="genome", relationship ="many-to-many") %>%# Filter out NA and add multiple KOs on new linefilter(KEGG_ko !="-") %>%separate_rows(KEGG_ko, sep =",") %>%# Sum the abundance of each KO within the metagenomesgroup_by(genome, KEGG_ko) %>%summarise(abundance =sum(abundance), .groups ="drop") %>%# Pivot to prepare a numerical matrixpivot_wider(names_from ="KEGG_ko", values_from ="abundance", values_fill =0) %>%column_to_rownames("genome") %>% vegan::metaMDS()
Square root transformation
Wisconsin double standardization
Run 0 stress 0.06277184
Run 1 stress 0.06384113
Run 2 stress 0.06856721
Run 3 stress 0.06286035
... Procrustes: rmse 0.07810586 max resid 0.1485213
Run 4 stress 0.06368732
Run 5 stress 0.06286033
... Procrustes: rmse 0.07811342 max resid 0.1485397
Run 6 stress 0.06362823
Run 7 stress 0.06902555
Run 8 stress 0.06610897
Run 9 stress 0.0679945
Run 10 stress 0.06384089
Run 11 stress 0.06299223
... Procrustes: rmse 0.07837556 max resid 0.1496081
Run 12 stress 0.06299208
... Procrustes: rmse 0.07823435 max resid 0.1492526
Run 13 stress 0.06386464
Run 14 stress 0.07029334
Run 15 stress 0.06796592
Run 16 stress 0.06500024
Run 17 stress 0.1937571
Run 18 stress 0.06459237
Run 19 stress 0.06660911
Run 20 stress 0.06299231
... Procrustes: rmse 0.0783978 max resid 0.1496653
*** Best solution was not repeated -- monoMDS stopping criteria:
2: no. of iterations >= maxit
18: stress ratio > sratmax